TL/CUDA: Linear Broadcast for GPU #948

ikryukov · 2024-03-22T14:30:30Z

What

Linear CUDA Broadcast implementation.

Why ?

Functional improvement, parity with others communication libraries.
Ability to place many ranks on single GPU
No GPU blocking, communication initiated from host

How ?

Naive approach where root rank writes data to own shared buffer and others ranks read from it through NVLink.

ikryukov · 2024-03-22T14:32:07Z

Configuration string:
./configure --with-ucx=$HPCX_UCX_DIR --with-cuda=/usr/local/cuda --with-mpi=$HPCX_MPI_DIR --enable-gtest --prefix=$PWD/install --with-nvcc-gencode="-gencode=arch=compute_80,code=sm_80" --enable-debug
Run string:
mpirun --mca coll ^hcoll --mca coll_ucc_enable 0 -x LD_LIBRARY_PATH=/home/ikryukov/work/ucc/install/lib:$LD_LIBRARY_PATH -x UCC_CLS=basic -x UCC_TLS=ucp,cuda -x xUCC_LOG_LEVEL=info -x UCC_TL_CUDA_LOG_LEVEL=debug -x UCC_LOG_LEVEL=info -x UCC_CONFIG_FILE= -np 2 ./install/bin/ucc_test_mpi -c bcast --teams world -M cuda -O 0 -S 2

swx-jenkins3 · 2024-03-22T14:33:43Z

Can one of the admins verify this patch?

samnordmann

Looks good to me! Thanks!
I only left some minor remarks. Can you, in addition, add this algo to the tests?

src/components/tl/cuda/tl_cuda.h

src/components/tl/cuda/bcast/bcast_linear.c

samnordmann · 2024-08-19T16:34:53Z

src/components/tl/cuda/bcast/bcast_linear.c

+                    return;
+                }
+            } else {
+                ucc_debug("etask is nullptr");


isn't this case an infinite loop? I am not sure to understand

it is error case, used it for debug, it should not happen in real situations

Ok, why not using ucc_assert here then?

src/components/tl/cuda/bcast/bcast_linear.c

ikryukov · 2024-08-26T11:27:41Z

Looks good to me! Thanks! I only left some minor remarks. Can you, in addition, add this algo to the tests?

Thanks for review, addressed comments and added test to validate bcast for cuda too.

src/components/tl/cuda/bcast/bcast_linear.c

src/components/tl/cuda/tl_cuda.h

janjust

looks good

manjugv · 2024-09-18T17:12:28Z

ping @Sergei-Lebedev

ikryukov marked this pull request as draft March 22, 2024 14:32

ikryukov force-pushed the cuda_bcast branch from 7c22cb5 to c9e7048 Compare April 12, 2024 13:42

ikryukov force-pushed the cuda_bcast branch 3 times, most recently from 00f3922 to 6caea67 Compare July 4, 2024 15:38

ikryukov force-pushed the cuda_bcast branch from e6f4223 to f6d7536 Compare August 2, 2024 15:54

ikryukov marked this pull request as ready for review August 2, 2024 16:17

Sergei-Lebedev requested review from janjust, Sergei-Lebedev and samnordmann August 14, 2024 09:14

Sergei-Lebedev added the Ready-for-Review label Aug 14, 2024

samnordmann reviewed Aug 19, 2024

View reviewed changes

ikryukov force-pushed the cuda_bcast branch from 11c7311 to 99264d8 Compare August 23, 2024 13:17

Sergei-Lebedev reviewed Aug 30, 2024

View reviewed changes

src/components/tl/cuda/bcast/bcast_linear.c Outdated Show resolved Hide resolved

src/components/tl/cuda/bcast/bcast_linear.c Show resolved Hide resolved

src/components/tl/cuda/tl_cuda.h Outdated Show resolved Hide resolved

samnordmann self-requested a review September 9, 2024 08:16

samnordmann approved these changes Sep 9, 2024

View reviewed changes

janjust approved these changes Sep 9, 2024

View reviewed changes

ikryukov added 9 commits October 28, 2024 11:35

TL/CUDA: add linear bcast

999b43e

TL/CUDA: fix build

fceeb8f

TL/CUDA: wip

dc65324

TL/CUDA: fix compilation

7437518

TL/CUDA: calc size

62157cd

TL/CUDA: wip some logic for root

a0f225e

TL/CUDA: wip logic for client

4187516

TL/CUDA: added barrier to sync stages

3f0e4d9

TL/CUDA: non zero root

cd13b04

ikryukov added 17 commits October 28, 2024 11:35

TL/CUDA: revert commented

54b84dc

TL/CUDA: wip multistep

7989e60

TL/CUDA: fix step check

e400efd

TL/CUDA: minor cleanup

2c94cbf

TL/CUDA: removed breaks

3ff154b

TL/CUDA: fix linter

1626808

TL/CUDA: double buffering

bb555de

TL/CUDA: moved get/set rank step

190f1e3

TL/CUDA: changed logs to debug lvl

d76629c

TL/CUDA: minor cleanups

e444cda

TL/CUDA: addressed comments

1ca8866

TL/CUDA: removed done stage

2af5ea5

TL/CUDA: added unit test

38489b3

TL/CUDA: addressed comments

b9272bc

TL/CUDA: fix formatting

2c7ae32

TL/CUDA: fixed compilation

5b8c8c7

TL/CUDA: fix include

87c4424

ikryukov force-pushed the cuda_bcast branch from 8b8720c to 87c4424 Compare October 28, 2024 10:36

ikryukov added 9 commits October 28, 2024 11:48

TL/CUDA: removed returns

60ea289

TL/CUDA: active set support

a4887f7

TL/CUDA: fix build

082c643

TL/CUDA: fixed comments

168d2e1

TL/CUDA: select free bar using atomic

8a7adc4

TL/CUDA: fix

f2404db

TL/CUDA: replace free tag

f4f1e5a

TL/CUDA: fix bar tag init val

e9c1abe

TL/CUDA: added tag print

2e808fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/CUDA: Linear Broadcast for GPU #948

TL/CUDA: Linear Broadcast for GPU #948

ikryukov commented Mar 22, 2024 •

edited

Loading

ikryukov commented Mar 22, 2024 •

edited

Loading

swx-jenkins3 commented Mar 22, 2024

samnordmann left a comment

samnordmann Aug 19, 2024

ikryukov Aug 21, 2024

samnordmann Sep 9, 2024

ikryukov commented Aug 26, 2024

janjust left a comment

manjugv commented Sep 18, 2024

TL/CUDA: Linear Broadcast for GPU #948

Are you sure you want to change the base?

TL/CUDA: Linear Broadcast for GPU #948

Conversation

ikryukov commented Mar 22, 2024 • edited Loading

What

Why ?

How ?

ikryukov commented Mar 22, 2024 • edited Loading

swx-jenkins3 commented Mar 22, 2024

samnordmann left a comment

Choose a reason for hiding this comment

samnordmann Aug 19, 2024

Choose a reason for hiding this comment

ikryukov Aug 21, 2024

Choose a reason for hiding this comment

samnordmann Sep 9, 2024

Choose a reason for hiding this comment

ikryukov commented Aug 26, 2024

janjust left a comment

Choose a reason for hiding this comment

manjugv commented Sep 18, 2024

ikryukov commented Mar 22, 2024 •

edited

Loading

ikryukov commented Mar 22, 2024 •

edited

Loading